🚨🚨 Refactor Image Processors to support different backends#43514
🚨🚨 Refactor Image Processors to support different backends#43514yonigozlan merged 59 commits intohuggingface:mainfrom
Conversation
ArthurZucker
left a comment
There was a problem hiding this comment.
this on a great direction!
I think that having the image_processor_xxxx in general calling self.resize works well as it would fetch the backend's method.
The thing I am not seeing right now is for example how would someone go about adding a new ImageProcessingLlavaNext but with say mlx processing.
He has to create a class that would inherit from his custom mixin, then there needs to be a way for him automatically make sure that MlxImageProcessingLlavaNext is the class that is gonne be used when requesting mlx-backend.
If we are able to take that into account we should be fairly ready!
Otherwise very nice for now!
| `bool`: Whether or not this image processor is using the fast (TorchVision) backend. | ||
| """ | ||
| return False | ||
| return self.backend == "torchvision" |
There was a problem hiding this comment.
this attribute should be removed imo. Numpy can be faster in some cases and it does not represent anything anymore
There was a problem hiding this comment.
added a deprecation cycle as I think it's used by downstream libraries
| # Backend availability checkers: maps backend names to functions that check availability | ||
| _backend_availability_checks = { | ||
| "torchvision": is_torchvision_available, | ||
| "python": lambda: True, # Python backend is always available |
There was a problem hiding this comment.
| "python": lambda: True, # Python backend is always available | |
| "numpy": lambda: True, # Python backend is always available |
It relies on numpy no? (just saying the name should probably be different)
There was a problem hiding this comment.
Yes it's a bit misleading, but the vision operation are handled by PiL (with numpy arrays as inputs/outputs), so maybe naming the backend as "pil" is better? Plus it makes the fact that PiL is a required dependency to use this backend more explicit.
|
Thanks @ArthurZucker !
# In image_processing_utils.py - create the generic MLX backend if it doesn't already exists
class MlxBackend(ImageProcessingBackend):
def resize(self, image, size, **kwargs):
# generic MLX resize
pass
# ... other generic MLX methods
# In llava_next/image_processing_llava_next.py - inherit from it
class LlavaNextMlxBackend(MlxBackend):
def preprocess(self, images, image_grid_pinpoints, **kwargs):
# LlavaNext-specific patch processing with MLX
pass
class LlavaNextImageProcessor(BaseImageProcessor):
_backend_classes = {
"torchvision": LlavaNextTorchVisionBackend,
"python": LlavaNextPythonBackend,
"mlx": LlavaNextMlxBackend,
}
_backend_availability_checks = {
"torchvision": is_torchvision_available,
"python": lambda: True,
"mlx": is_mlx_available,
}
from transformers import ImageProcessingBackend, LlavaNextImageProcessor
# No need for users to add both an MLX mixin and an inherited LlavaNextMlxBackend, just overwrite the necessary method directly in LlavaNextMlxBackend
class LlavaNextMlxBackend(ImageProcessingBackend):
def resize(self, image, size, **kwargs):
# your MLX implementation
pass
# ... implement other methods
LlavaNextImageProcessor.register_backend(
name="mlx",
backend_class=LlavaNextMlxBackend,
availability_check=lambda: is_mlx_available() # optional
)
processor = LlavaNextImageProcessor(backend="mlx")Then instantiate like this: processor = LlavaNextImageProcessor.from_pretrained("llava-hf/llama3-llava-next-8b-hf", backend="mlx") |
|
|
||
| @requires(backends=("vision",)) | ||
| @lru_cache(maxsize=10) | ||
| def validate_fast_preprocess_arguments( |
There was a problem hiding this comment.
what is the fast sense here?
There was a problem hiding this comment.
None, needs to be renamed/modified 😁
| "pil": MyPilBackend, | ||
| } | ||
|
|
||
| To add a new backend, extend both `_backend_classes` and `_backend_availability_checks`: |
There was a problem hiding this comment.
let's rather push for register?
| resample = None | ||
| image_mean = None | ||
| image_std = None | ||
| size = None | ||
| default_to_square = True | ||
| crop_size = None | ||
| do_resize = None | ||
| do_center_crop = None | ||
| do_pad = None | ||
| pad_size = None | ||
| do_rescale = None | ||
| rescale_factor = 1 / 255 | ||
| do_normalize = None | ||
| do_convert_rgb = None | ||
| return_tensors = None | ||
| data_format = ChannelDimension.FIRST | ||
| input_data_format = None | ||
| device = None | ||
| model_input_names = ["pixel_values"] | ||
| image_seq_length = None |
There was a problem hiding this comment.
i really don't understand why you have these when yuou also have ImageKwargs? does it not defeat the point ?
| Update kwargs that need further processing before being validated. | ||
| Can be overridden by subclasses to customize the processing of kwargs. | ||
| """ |
There was a problem hiding this comment.
this function looks very weird.... but okay
| # Extract parameters that are only used for preparing the input images | ||
| do_convert_rgb = kwargs.pop("do_convert_rgb") | ||
| input_data_format = kwargs.pop("input_data_format") | ||
| device = kwargs.pop("device") |
There was a problem hiding this comment.
this is weird as well IDG why they can't fall through the rest normally
| """ | ||
| Preprocess an image or a batch of images. | ||
| """ | ||
| validate_kwargs(captured_kwargs=kwargs.keys(), valid_processor_keys=self._valid_kwargs_names) |
There was a problem hiding this comment.
why do we have so many validation steps? validate-kwargs, which are type dicts, then validate typedict, then set default, then further process, then validate process.
It "looks" mega bloated
ArthurZucker
left a comment
There was a problem hiding this comment.
much better / simpler imo!
| return BatchFeature(data={"pixel_values": processed_images}, tensor_type=return_tensors) | ||
|
|
||
|
|
||
| class PilBackend(BaseImageProcessor): |
There was a problem hiding this comment.
not a super strong opinion but I would probably split in different files!
| For processors that only need standard operations (resize, center crop, rescale, normalize), define class | ||
| attributes: | ||
|
|
||
| class MyImageProcessor(BaseImageProcessor): |
There was a problem hiding this comment.
| class MyImageProcessor(BaseImageProcessor): | |
| class MyImageProcessor(PilBackend): |
IDK I might be wrong!
There was a problem hiding this comment.
Yep sorry the docstrings were out of date!
| class MyImageProcessor(BaseImageProcessor): | ||
| _backend_classes = { | ||
| "torchvision": MyTorchVisionBackend, | ||
| "pil": MyPilBackend, | ||
| } |
There was a problem hiding this comment.
this is not valid anymore but you probably did not have the time to update it
| validate_typed_dict(self.valid_kwargs, kwargs) | ||
|
|
||
| # Set default kwargs from self | ||
| for kwarg_name in self._valid_kwargs_names: | ||
| kwargs.setdefault(kwarg_name, getattr(self, kwarg_name, None)) | ||
|
|
||
| # Update kwargs that need further processing before being validated | ||
| kwargs = self._standardize_kwargs(**kwargs) | ||
|
|
||
| # Validate kwargs | ||
| print("kwargs: ", kwargs) | ||
| self._validate_preprocess_kwargs(**kwargs) | ||
|
|
||
| return self._preprocess_image_like_inputs(images, *args, **kwargs) |
There was a problem hiding this comment.
still same comment, but its fine to adress later / it looks a bit more simple !
| if isinstance(image_processor_mapping, (list, tuple)): | ||
| pil_class, torchvision_class = image_processor_mapping | ||
| image_processor_mapping = {"pil": pil_class, "torchvision": torchvision_class} |
There was a problem hiding this comment.
not 100% sure when would that happen?
There was a problem hiding this comment.
maybe if we update register to support tuple (code that would already be there) then we won't need this?
There was a problem hiding this comment.
here I don't get it, type(config) exits, why do we create image_processor_mapping ? when it should already be correct?
| do_reduce_labels: bool = False, | ||
| **kwargs, | ||
| ) -> None: | ||
| resample = PILImageResampling.BICUBIC |
There was a problem hiding this comment.
I am seeing TorchVisionBackendbut then PILImageResampling with PIL, weird but I guess its just an enum
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, beit, bit, blip, bridgetower, chameleon, chinese_clip |
…image processors (directly in modular model converter)
The elif branch for URL detection (is_remote_url + download_url) was accidentally removed in huggingface#43514 during the image processor refactor. This restores URL support with a local download_url helper using httpx, since the old utils.hub.download_url was intentionally dropped in v5. Fixes huggingface#44821
* fix * check * revert --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…age processor backend refactor The PR #43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize.
…r backend refactor (#45258) * Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor The PR #43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize. * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…r backend refactor (huggingface#45258) * Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize. * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…r backend refactor (huggingface#45258) * Fix SmolVLM video processor resize using wrong interpolation after image processor backend refactor The PR huggingface#43514 refactored _preprocess to pass resample=resample to resize, but resize still accepted interpolation as its parameter. The resample kwarg was silently swallowed by **kwargs, causing interpolation to default to BILINEAR instead of the intended LANCZOS->BICUBIC path, producing ~0.36 difference in pixel_values. Fix by renaming the parameter to resample and converting PIL resample integers to torchvision InterpolationMode via pil_torch_interpolation_mapping, matching the pattern used in TorchvisionBackend.resize. * fix --------- Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Image Processor Backend Refactor
Summary
Replaces the dual-file
BaseImageProcessor(slow/PIL) +BaseImageProcessorFast(fast/torchvision) design with a unified backend architecture. Theimage_processing_utils_fastmodule is removed; all logic lives inimage_processing_utilsandimage_processing_backends.New Structure
Base classes:
BaseImageProcessorinimage_processing_utilsdefines the shared preprocessing pipeline (kwargs validation, input preparation, dispatching to backends). The built-in backend classes live in a separate file,image_processing_backends.py:TorchvisionBackend: GPU-accelerated, batched operations ontorch.Tensor, channels-firstPilBackend: Portable CPU-only, operations onnp.ndarray, channels-firstEach backend implements
process_image(convert raw input to backend format) and_preprocess(batch operations). Model-specific processors inherit from one of these backends.File layout: Per model:
image_processing_<model>.pyfor the torchvision backend (default),image_processing_pil_<model>.pyfor the PIL backend when both exist. The no-suffix class is now torchvision (opposite of the old*Fastconvention).Shared pipeline: Both backends use the same
preprocessflow: validate kwargs → standardize (size, crop_size, pad_size, resample) → prepare inputs viaprocess_image→ run_preprocess. Torchvision batches by shape for efficiency; PIL processes images one by one.Loading Paths & Fallback Logic
AutoImageProcessor.from_pretrained: Config resolution order: image processor config → nested processor config → model config. Class resolution usesimage_processor_typeorauto_map["AutoImageProcessor"], with fallback from legacyfeature_extractor_type/AutoFeatureExtractor.Backend resolution: New
backendparameter replacesuse_fast. Resolution order: (1) deprecateduse_fast→ converted to backend with warning; (2) explicitbackend→ used as-is; (3) default:"pil"for Lanczos models (Chameleon, Flava, Idefics3, SmolVLM); otherwise"torchvision"if available, else"pil".Mapping format:
IMAGE_PROCESSOR_MAPPING_NAMESentries are now{"torchvision": "ClassName", "pil": "ClassNamePil"}dicts instead of(slow, fast)tuples. Models may expose one or both backends.Fallback when backend unavailable:
_load_class_with_fallbacktries the requested backend first, then other backends in the mapping. If torchvision is requested but unavailable, falls back to PIL with a warning.Registering New / Custom Backends
AutoImageProcessor.register(): Registers image processor classes for a given config. The preferred API isimage_processor_classes={"backend_name": ProcessorClass}. You can register one or more backends per model type.Custom backends: The backend key space is open: any string (e.g.
"torchvision","pil","mlx","onnx"etc.) can be used. Each processor class must inherit fromBaseImageProcessorand implementprocess_imageand_preprocess. Users select a backend viaAutoImageProcessor.from_pretrained(..., backend="custom"). The same fallback logic applies: if the requested backend is unavailable (e.g. missing deps), loading tries other backends in the mapping.Legacy params:
slow_image_processor_classandfast_image_processor_classare deprecated; they are converted toimage_processor_classes={"pil": ...}andimage_processor_classes={"torchvision": ...}respectively.Partial updates: When re-registering a config that already has backends, passing
image_processor_classesmerges into the existing mapping (e.g. adding a new backend without overwriting existing ones).Backward Compatibility
use_fast=True/False: Deprecated warning; converted tobackend="torchvision"/backend="pil".image_processor_type: "FooImageProcessorFast"in config: StripsFastsuffix; resolves to base class and requested backend.BaseImageProcessorFastclass name: Resolves toTorchvisionBackend.FooImageProcessorFastvia import:_LazyModule/get_image_processor_class_from_nameresolves toFooImageProcessorwhen Fast class no longer exists.from transformers import FooImageProcessorwhen torchvision missing:_LazyModule.__getattr__transparently falls back toFooImageProcessorPiland warns once (import_utils).auto_map: [slow, fast]list:_resolve_auto_map_class_refsupports both list and new dict format.slow_image_processor_class/fast_image_processor_classinregister(): Converted to newimage_processor_classes={}dict form.is_fastproperty: Deprecated; useprocessor.backend == "torchvision".Other Changes
resample: Single parameter name; Torchvision backend maps PIL resample toInterpolationModeinternally.SizeDict: Used consistently in_preprocess; dict literals remain for class attribute defaults._set_attributes: Centralized inBaseImageProcessor; backends call it in__init__to resolve kwargs and class defaults.import_utils.BASE_FILE_REQUIREMENTS: Still treatsimage_processing*_fast.pyas torchvision-backed for lazy import structure; legacy_fastfilenames may remain until models are fully migrated.